Add 'From Zero to Zarr' beginner guide to the Zarr data model by chuckwondo · Pull Request #4077 · zarr-developers/zarr-python

chuckwondo · 2026-06-17T23:14:32Z

Adds a new user-guide page (docs/user-guide/data_model.md, nav label "Understanding Zarr") that explains the Zarr data model for newcomers: why Zarr exists (its parallel-computing origin in genomics), then arrays, chunking and the chunk grid, stores as key->bytes maps, metadata (zarr.json), the specification, codecs, sharding, groups, and N-D arrays, ending with a runnable round-trip example and a cross-language note. Prose

diagrams throughout, with executable, build-verified code in the final section, and every spec detail linked to its section of the Zarr v3 spec.

Enables Mermaid diagrams via a pymdownx.superfences custom fence, and adds the page to the User Guide nav.

Closes #4056

TODO:

Add unit tests and/or doctests in docstrings
Add docstrings and API docs for any new/modified user-facing classes and functions
New/modified features documented in docs/user-guide/*.md
Changes documented as a new file in changes/
GitHub Actions have all passed
Test coverage is 100% (Codecov passes)

Adds a new user-guide page (docs/user-guide/data_model.md, nav label "Understanding Zarr") that explains the Zarr data model for newcomers: why Zarr exists (its parallel-computing origin in genomics), then arrays, chunking and the chunk grid, stores as key->bytes maps, metadata (zarr.json), the specification, codecs, sharding, groups, and N-D arrays, ending with a runnable round-trip example and a cross-language note. Prose + diagrams throughout, with executable, build-verified code in the final section, and every spec detail linked to its section of the Zarr v3 spec. Enables Mermaid diagrams via a pymdownx.superfences custom fence, and adds the page to the User Guide nav. Closes zarr-developers#4056 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

d-v-b · 2026-06-18T12:57:41Z

+
+Chunking is the key move. Each chunk can be stored, loaded, and compressed on its
+own, so a program can read just the chunks it needs — that one corner your
+colleague wanted — without touching the rest. (Starting with a chunk shape that


could we use an inline admonition for the partial-chunk callout? something like

Note
If each chunk has a fixed size, how can we use chunks to represent an array that isn't evenly divided by the chunk size? See #section for the answer to that question!

not sure if note is the right admonition here

d-v-b · 2026-06-24T20:31:53Z

@mkitti if you have time it would be good to get your thoughts on this

d-v-b · 2026-06-24T20:34:07Z

+  G11 --> K11
+```
+
+Where does a key like `c/0/1` come from? It's built by a simple, fixed rule (the


"fixed rule" locally implies that arrays have 1 chunk key encoding. maybe we can rephrase to make it clear that there's a rule defined by a particular field in array metadata.

codecov · 2026-06-24T20:35:27Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 93.50%. Comparing base (22818d9) to head (2b69319).

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #4077   +/-   ##
=======================================
  Coverage   93.50%   93.50%           
=======================================
  Files          90       90           
  Lines       11979    11979           
=======================================
  Hits        11201    11201           
  Misses        778      778

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

maxrjones

This is awesome @chuckwondo! I just have some small nits

maxrjones · 2026-06-25T14:09:06Z

+the *how* one idea at a time, until you understand **how Zarr stores an array**,
+**why** that layout is defined by a written specification, and **how a library
+turns those stored bytes back into an array you can use**.


Suggested change

the *how* one idea at a time, until you understand **how Zarr stores an array**,

**why** that layout is defined by a written specification, and **how a library

turns those stored bytes back into an array you can use**.

the *how* one idea at a time, until you understand **how Zarr stores an array**,

**why that layout is defined by a written specification**, and **how a library

turns those stored bytes back into an array you can use**.

nit about consistent use of bold text

maxrjones · 2026-06-25T14:10:00Z

+
+---
+
+## Why we need Zarr


If possible, it would be nice to have a tl;dr (maybe a note admonition) at the top of this section

maxrjones · 2026-06-25T14:10:59Z

+extraordinary firehoses of numbers. A satellite streams images of the Earth; a
+microscope captures gigapixel scans; a gene sequencer reads thousands of genomes;
+a climate model writes out temperature and wind for every point on the globe, hour
+after hour. In each case the result has the same shape: a vast grid of numbers —


many people have a negative reaction to em-dashes since their proliferation by AI. It would likely be worth reducing their use in this guide via more, shorter sentences.

maxrjones · 2026-06-25T14:13:25Z

+why, it helps to understand two things the array formats of the day were already
+doing.
+
+First, **chunking**. To store an array bigger than memory, formats like HDF5 and


It might be helpful to link to a glossary via hovertools (e.g., approach in https://github.com/developmentseed/datacube-guide/pull/35/changes#diff-98d0f806abc9af24e6a7c545d3d77e8f9ad57643e27211d7a7b896113e420ed2).

maxrjones · 2026-06-25T14:14:53Z

+(the [*Anopheles gambiae* 1000 Genomes Project](https://www.malariagen.net/)) —
+arrays far too big to fit in memory. His real frustration was *speed*, and to see
+why, it helps to understand two things the array formats of the day were already
+doing.


Suggested change

doing.

doing: chunking and compression.

If I'm reading this right, it's not totally obvious what "Second" is

maxrjones · 2026-06-25T18:49:53Z

+
+So a 5×6 array chunked at `(2, 3)` quietly stores a row of "phantom" cells holding
+the fill value. It's harmless, but it's a small waste — and a good reason to pick a
+chunk shape that fits your array's real shape reasonably well. (For practical


Suggested change

chunk shape that fits your array's real shape reasonably well. (For practical

chunk shape that fits your array's real shape reasonably well and lean on the

[rectilinear chunk grid extension](https://github.com/zarr-developers/zarr-extensions/tree/main/chunk-grids/rectilinear) when needed. (For practical

maxrjones · 2026-06-25T19:17:58Z

+[specification](https://zarr-specs.readthedocs.io/en/latest/v3/core/index.html#codecs)
+defines three kinds of codec, applied in this order:
+
+1. **array → array** codecs (optional, any number) — rearrange the values; e.g. a


Suggested change

1. **array → array** codecs (optional, any number) — rearrange the values; e.g. a

1. **array → array** codecs (optional, any number) — rearrange or change the values; e.g. a

I believe this change is more accurate, but would appreciate if @d-v-b confirms

maxrjones · 2026-06-25T19:19:47Z

+simple, but it has a limit: small chunks in a very large array produce a *huge*
+number of chunks, and therefore a huge number of files or objects. The spec notes
+this is exactly where file systems (block sizes, inode limits) and object stores
+(which dislike millions of tiny objects) start to struggle.


I think the more prevalent limitation on object stores is the cost model, where the cost of operations often scales with the number of objects

maxrjones · 2026-06-25T19:25:43Z

+or more axes.
+
+To see the generalisation concretely, picture a 3-D array as a **stack of 2-D
+arrays**. Here are two copies of our 4×6 grid stacked into a `(2, 4, 6)` array —


Suggested change

arrays**. Here are two copies of our 4×6 grid stacked into a `(2, 4, 6)` array —

arrays**. Here are two versions of our 4×6 grid stacked into a `(2, 4, 6)` array —

maxrjones · 2026-06-25T19:27:57Z

+- write it to the corresponding slice of the array,
+- discard it, and move on to the next block.
+
+Because only one block is ever in memory, the array on disk can be far larger than


Suggested change

Because only one block is ever in memory, the array on disk can be far larger than

Because the minimum amount of data ever needed in memory to be useful is a single block, the array on disk can be far larger than

chuckwondo requested review from d-v-b and maxrjones June 17, 2026 23:14

Merge branch 'main' into docs/from-zero-to-zarr

3e1a62d

d-v-b reviewed Jun 18, 2026

View reviewed changes

Merge branch 'main' into docs/from-zero-to-zarr

2b69319

d-v-b reviewed Jun 24, 2026

View reviewed changes

maxrjones reviewed Jun 25, 2026

View reviewed changes

	chunk shape that fits your array's real shape reasonably well. (For practical
	chunk shape that fits your array's real shape reasonably well and lean on the
	[rectilinear chunk grid extension](https://github.com/zarr-developers/zarr-extensions/tree/main/chunk-grids/rectilinear) when needed. (For practical

	1. array → array codecs (optional, any number) — rearrange the values; e.g. a
	1. array → array codecs (optional, any number) — rearrange or change the values; e.g. a

	arrays**. Here are two copies of our 4×6 grid stacked into a `(2, 4, 6)` array —
	arrays**. Here are two versions of our 4×6 grid stacked into a `(2, 4, 6)` array —

	Because only one block is ever in memory, the array on disk can be far larger than
	Because the minimum amount of data ever needed in memory to be useful is a single block, the array on disk can be far larger than

Uh oh!

Uh oh!

Conversation

chuckwondo commented Jun 17, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

d-v-b commented Jun 24, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

maxrjones left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov Bot commented Jun 24, 2026 •

edited

Loading